RNASeq
National Facility for Data Handling and Analysis - Human Technopole
This report includes the results of the RNASeq analysis pipeline developed by the National Facility for Data Handling and Analysis based on the nf-core/rnaseq pipeline, with some modifications to suit our needs. This pipeline performs quality control, alignment, quantification, and differential expression analysis of bulk transcriptomics data.
This report has been generated by the nfdata-omics/rnaseq analysis pipeline.
/scratch/camilla.callierotti/nextflow/96/4eaa7aa2175d2c0935931c96dedf4c
Sample-to-sample Correlation
Sample-to-sample Pearsons' correlation is calculated from the CPM values of all expressed genes. The resulting heatmap illustrates the pairwise correlation indices, with red indicating higher correlation and blue indicating lower correlation.
Dimensionality Reduction
Principal Component Analysis (PCA) is a dimensionality reduction algorithm used to summarize the variability structure of an entire dataset into a few latent variables, called principal components.
Each component is defined as a linear combination of the original features (for example, genes) and is orthogonal (and thus independent) to all previous components.
Components are defined as eigenvector of the covariance matrix of the data, with the first component explaining the largest percentage of variance and subsequent components accounting for progressively smaller proportions of variance.
Multidimensional Scaling (MDS) is also a dimensionality reduction algorithm that seeks to preserve the pairwise distances between samples when projecting data into a lower-dimensional space.
Unlike PCA, which identifies linear combinations of features to maximize variance, MDS is more flexible and works directly with a distance matrix, making it suitable for data where the relationships are non-linear or not well captured by variance.
If the relationships between the variables in the dataset are primarily linear and the variance structure is well-defined, results from the two algorithms are likely to converge.
In a PCA/MDS scatterplot, each dot represents a sample; dots that are closer together indicate more similar samples, while distant dots suggest highly divergent samples or potential outliers.
The coordinates of each dot on the plot are calculated by the algorithms in an unsupervised manner.
However, dots can be color-coded based on sample characteristics chosen by the users, enabling the identification of whether clusters of samples are associated with specific experimental or technical variables.
Both analyses were performed on normalized expression values of the top 5000 most variable genes among the expressed ones. The data were preprocessed by centering and scaling to achieve a zero mean and unit variance.
The following interactive plots display the distribution of samples in the reduced component spaces and can be used to highlight different sample characteristics as reported in the metadata.
PCA Scatter Plot with Metadata
Interactive PCA scatter plot with metadata filtering options.
MDS Scatter Plot with Metadata
Interactive MDS scatter plot with metadata filtering options.
Enrichment analysis
Over-representation and GSEA analysis.
Over-representation and GSEA Results
Interactive table showing enrichment analysis results for each dataset and comparison. Click on any row to expand and view the corresponding plots.
This table shows the over-representation and gsea plots for each comparison and dataset. Click on any row to expand and see dot plots for Enrichment analysis (with toggles for up/down/all regulated genes) . Dot size represents effect size, and color represents significance.
| Dataset | Comparisons | Data Summary |
|---|---|---|
| GSE52778_raw_counts_GRCh38.p13_NCBI | Treatment:Albuterol:Untreated, Treatment:Albute... |
Over-representation: 3/3
GSEA: 0/3
|
|
Comparison: Treatment:Albuterol:Untreated
Over-representation: ✓
GSEA: ✗
Over-representation AnalysisGSEA Analysis
Comparison: Treatment:Albuterol_Dexamethasone:Untreated
Over-representation: ✓
GSEA: ✗
Over-representation AnalysisGSEA Analysis
Comparison: Treatment:Dexamethasone:Untreated
Over-representation: ✓
GSEA: ✗
Over-representation AnalysisGSEA Analysis |
||
Software Versions
Software Versions lists versions of software tools extracted from file contents.
| Group | Software | Version |
|---|---|---|
| CLUSTERPROFILER_ORA | R | 4.4.2 |
| clusterProfiler | 4.14.4 | |
| enrichplot | 1.26.6 | |
| msigdbr | 7.5.1 | |
| openxlsx | 4.2.8 | |
| optparse | 1.7.5 | |
| CONTROL_GENE_HEATMAP | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| ggplot2 | 3.5.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| pheatmap | 1.0.12 | |
| stringr | 1.5.1 | |
| CUSTOM_GETCHROMSIZES | getchromsizes | 1.21 |
| DESEQ2_COMPARE | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| DESeq2 | 1.46.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| ggplot2 | 3.5.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| stringr | 1.5.1 | |
| DESEQ2_FIT | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| DESeq2 | 1.46.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| edgeR | 4.4.0 | |
| limma | 3.62.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| stringr | 1.5.1 | |
| GENEID_TO_GENENAME | mawk | null |
| sed | null | |
| GSEA | R | 4.4.2 |
| clusterProfiler | 4.14.4 | |
| enrichplot | 1.26.6 | |
| openxlsx | 4.2.8 | |
| optparse | 1.7.5 | |
| GSEA_MERGE | R | 4.4.2 |
| openxlsx | 4.2.7.1 | |
| optparse | 1.7.5 | |
| stringr | 1.5.1 | |
| GTF2BED | perl | 5.26.2 |
| GTF_FILTER | python | 3.9.5 |
| GUNZIP_FASTA | gunzip | 1.1 |
| GUNZIP_GTF | gunzip | 1.1 |
| GUNZIP_TRANSCRIPT_FASTA | gunzip | 1.1 |
| OBJ_CONSTRUCTION | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| PCA_AND_MDS | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| edgeR | 4.4.0 | |
| ggplot2 | 3.5.1 | |
| limma | 3.62.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| R_COUNT_NORM | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| edgeR | 4.4.0 | |
| limma | 3.62.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| SALMON_INDEX | salmon | 1.10.3 |
| SAMPLES_CORRELATION | Biobase | 2.66.0 |
| BiocGenerics | 0.52.0 | |
| GenomeInfoDb | 1.42.0 | |
| GenomicRanges | 1.58.0 | |
| IRanges | 2.40.0 | |
| MatrixGenerics | 1.18.0 | |
| R | 4.4.2 | |
| S4Vectors | 0.44.0 | |
| SummarizedExperiment | 1.36.0 | |
| ggplot2 | 3.5.1 | |
| matrixStats | 1.4.1 | |
| optparse | 1.7.5 | |
| pheatmap | 1.0.12 | |
| stringr | 1.5.1 | |
| STAR_GENOMEGENERATE | gawk | 5.1.0 |
| samtools | 1.2 | |
| star | 2.7.11b | |
| UCSC_GTFTOGENEPRED | ucsc | 447 |
| Workflow | Nextflow | 24.04.4 |
| nfdata-omics/rnaseq | v0.0.0dev-gdd86bb6 |
nfdata-omics/rnaseq Methods Description
Suggested text and references to use when describing pipeline usage within the methods section of a publication.URL: https://github.com/nfdata-omics/rnaseq
Methods
Data was processed using nfdata-omics/rnaseq v0.0.0dev of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
The pipeline was executed with Nextflow v24.04.4 (Di Tommaso et al., 2017) with the following command:
nextflow run nfdata-omics/rnaseq -r custom_multiqc_test -hub ht_gitlab -params-file nf-params.yml -profile ht_cluster -ansi-log false -resume
References
- Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. doi: 10.1038/nbt.3820
- Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. doi: 10.1038/s41587-020-0439-x
- Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., Köster, J., & Bioconda Team. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7), 475–476. doi: 10.1038/s41592-018-0046-7
- da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580–2582. doi: 10.1093/bioinformatics/btx192
Notes:
- If available, make sure to update the text to include the Zenodo DOI of version of the pipeline used.
- The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!
- You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.
nfdata-omics/rnaseq Workflow Summary
- this information is collected when the pipeline is started.URL: https://github.com/nfdata-omics/rnaseq
Input/output options
- input
- /facility/nfdata-omics/projects/IU2_024_SORT1-KO-Thyroid_Coscia/results/20251114/sample_sheet.csv
- metadata
- /home/camilla.callierotti/rnaseq_agent_demo/metadata.csv
- outdir
- output
Count matrix options
- counts
- /home/camilla.callierotti/rnaseq_agent_demo/GSE52778_raw_counts_GRCh38.p13_NCBI.tsv
Reference genome options
- fasta
- /facility/nfdata-omics/reference/human/gencode/v44/GRCh38.p14.genome.fa.gz
- gtf
- /facility/nfdata-omics/reference/human/gencode/v44/gencode.v44.annotation.gtf.gz
- transcript_fasta
- /facility/nfdata-omics/reference/human/gencode/v44/gencode.v44.transcripts.fa.gz
Dimensionality reduction and DEA
- comparisons
- Treatment:Dexamethasone:Untreated,Treatment:Albuterol:Untreated,Treatment:Albuterol_Dexamethasone:Untreated,Treatment:Albuterol_Dexamethasone:Dexamethasone
- control_genes_list
- /home/camilla.callierotti/rnaseq_agent_demo/ctrl_genes.txt
- fdr_pathways
- 0.1
- frac_expressed
- 0.25
- genesets
- /facility/nfdata-omics/reference/human/MSigDB/v2024.1.Hs/c2.all.v2024.1.Hs.entrez.gmt,/facility/nfdata-omics/reference/human/MSigDB/v2024.1.Hs/c5.all.v2024.1.Hs.entrez.gmt,/facility/nfdata-omics/reference/human/MSigDB/v2024.1.Hs/h.all.v2024.1.Hs.entrez.gmt
- lfc_threshold
- 0.0
- model_formula
- ~0+Treatment
Core Nextflow options
- configFiles
- N/A
- containerEngine
- singularity
- launchDir
- /home/camilla.callierotti/rnaseq_agent_demo
- profile
- ht_cluster
- projectDir
- /home/camilla.callierotti/.nextflow/assets/nfdata-omics/rnaseq
- revision
- custom_multiqc_test
- runName
- hopeful_koch
- userName
- camilla.callierotti
- workDir
- /scratch/camilla.callierotti/nextflow